[1] 300
DATA1220-55, Fall 2024
2024-09-27
We will only be covering Chapter 4.1 on the normal distribution in your textbook
If you have an interest in math or statistics, you may want to read the rest of Chapter 4
4.2 - Geometric distribution
4.3 - Binomial distribution
4.4 - Negative binomial distribution
4.5 - Poisson distribution
Identify and describe the standard normal and normal distributions
Standardize normal distributions and calculate Z-scores
Calculate percentiles and exact probabilities
Apply the 68-95-99.7 Rule
Read a QQ-Plot (not in book)
Symmetric, unimodal, “bell-shaped”
Not as common as people think in real data
Strong assumption in small sample sizes ($)
Powerful statistical tests available when outcome approximates normal distribution
\(\mu\) (Greek letter mu) represents the mean
\(\sigma\) (Greek letter sigma) represents the standard deviation of the mean
\(N(\mu, \sigma)\) stands for a normal distribution with mean \(\mu\) and standard deviation \(\sigma\)
Vocabulary scores for 947 seventh-graders. Both histograms and density curves can be helpful in identifying normal distributions.
Vocabulary scores for 947 seventh-graders. Both histograms and density curves can be helpful in identifying normal distributions.
Dashed line is self-reported heights by females on OkCupid. Dark purple line is the normal distribution with the same mean and standard deviation. Light purple line is the US average.
Changing the mean shifts the “center” of the distribution. Changing the standard deviation alters the “width” of the distribution (i.e. variability).
A Z-score is the number of standard deviations a value falls above (when positive) or below (when negative) the mean of the data
Z-scores standardize a normal distribution by…
Centering the data at 0 by subtracting the mean from each score
Scaling the units of the data to 1 by dividing the centered data by the standard deviation
\[ \begin{aligned} Z&=\frac{\operatorname{observed value}-\operatorname{mean}}{\operatorname{standard deviation}} \\ &= \frac{x-\mu}{\sigma} \end{aligned} \]
The numerator of the Z-Score \(x-\mu\) calculates how many units an observed value is from the mean of the normal distribution
When \(x_i \approx \mu\), \(x_{\operatorname{centered}} \approx 0\)
The units of random variable \(X_{\operatorname{centered}}\) are the same as the units for the original variable \(X\)
For a given random variable \(X\) with a normal distribution, you center the data by calculating \(x_i-\mu\) for each value of \(X\) such that…
Dividing the numerator of the Z-score \(x-\mu\) by the denominator \(\sigma\) converts the units of the centered data to standard deviations
Converts “\(x_{\operatorname{centered}}\) is \(x_i - \mu\) units greater/lesser than the mean \(\mu\)” to “\(x_{\operatorname{scaled}}\) is \(x_i - \mu\) units greater/lesser than the mean
When \(x_i-\mu \approx \sigma\), \(x_{\operatorname{centered}} \approx 1\)
For scaled data, \(1 \operatorname{unit} = 1 \operatorname{standard deviation}\)
SAT scores are normally distributed with \(\mu=1500\) and \(\sigma=300\) (\(N(\mu=1500, \sigma = 300)\))
ACT scores are normally distributed with \(\mu=21\) and \(\sigma=5\) (\(N(21, 5)\))
How do we compare normal distributions with different locations and scales? Is Pam more above average than Jim? Vice versa?
If both Pam and Jim applied to John Carroll, who would be the better student to admit?
If SAT scores have the distribution \(N(\mu=1500, \sigma=300)\) and Pam’s SAT score is 1800, then Pam’s Z-score is…
\[ \begin{aligned} \operatorname{Z-Score}&=\frac{x-\mu}{\sigma} \\ &= \frac{1800-1500}{300} \\ &= 1 \end{aligned} \]
Pam’s SAT Z-score is 1, so Pam’s SAT score is 1 standard deviation greater than the mean.
If ACT scores have the distribution \(N(\mu=21, \sigma=5)\) and Jim’s ACT score is 24, then Jim’s Z-score is…
\[ \begin{aligned} \operatorname{Z-Score}&=\frac{x-\mu}{\sigma} \\ &= \frac{24-21}{5} \\ &= 0.6 \end{aligned} \]
Jim’s ACT Z-score is 0.6, so Jim’s ACT score is 0.6 standard deviations greater than the mean.
Pam’s SAT Z-Score is 1
Jim’s ACT Z-Score is 0.6
Pam’s SAT score is more above average than Jim’s ACT score
The standard normal distribution is a normal distribution with \(\mu=0\) (centered) and \(\sigma=1\) (scaled)
The standard normal distribution is written \(N(\mu=0, \sigma=1)\))
Units of the standard normal distribution are standard deviations (Z-scores) (i.e. 1 unit = 1 SD)
Observations that are 2+ standard deviations from the mean are considered unusual
When data is (nearly) normally distributed…
~68% of the observations are within 1 standard deviation of the mean (\(\mu \pm \sigma\))
~95% of the observations are within 2 standard deviations of the mean (\(\mu \pm 2\sigma\))
99.7% of the observations are within 3 standard deviations of the mean (\(\mu \pm 3\sigma\))
The 68-95-99.7 Rule describes approximately what proportion of the observations should lie within 1, 2, and 3 standard deviations of the mean respectively, if the data is normally distributed
SAT scores have the distribution \(N(1500, 300)\)
~68% of scores will be 1200-1800
95% of scores will be 900-2100
99.7% of scores will be 600-2400
A percentile is the proportion or percentage of observations that fall below a given threshold in a distribution.
\[ \begin{aligned} \operatorname{Percentile}(X=x_i) &= \frac{\operatorname{count}(\operatorname{observations} \le x_i)}{\operatorname{count}(\operatorname{total observations})} \\ &=\operatorname{Proportion}(\operatorname{observations} \le x_i) \\ &=\operatorname{Probability}(\operatorname{any observation} \le x_i) \end{aligned} \]
You too can calculate probabilities for continuous numeric variables!
\[ P(X=x_i)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \]
You can use a Z-Score Table to look up the percentile that corresponds to a particular Z-Score for a standard normal distribution.
You can use a Z-Score Table to look up the probability that an observed Z-Score is less than or equal to a given Z-Score (i.e. threshold) for a standard normal distribution.
[1] 0.8413447
Area under a probability curve = 1 (i.e. sample space)
Probability above a threshold = 1 - percentile of threshold
\(P(X \le x_i)=1-P(X \not \le x_i)\)
\(P(X > x_i)=1-P(X \le X_i)\)
Sometimes the normal distribution is an acceptable approximation of a discrete numeric variable, but other distributions may be more appropriate.
Quantile-Quantile (QQ) Plots can help easily identify when you can and cannot assume normality.
DATA1220-55 Fall 2024, Class 13 | Updated: 2024-09-27 | Canvas | Campuswire